accelerating training
Ouroboros: On Accelerating Training of Transformer-Based Language Models
Language models are essential for natural language processing (NLP) tasks, such as machine translation and text summarization. Remarkable performance has been demonstrated recently across many NLP domains via a Transformer-based language model with over a billion parameters, verifying the benefits of model size. Model parallelism is required if a model is too large to fit in a single computing device. Current methods for model parallelism either suffer from backward locking in backpropagation or are not applicable to language models. We propose the first model-parallel algorithm that speeds the training of Transformer-based language models. We also prove that our proposed algorithm is guaranteed to converge to critical points for non-convex problems. Extensive experiments on Transformer and Transformer-XL language models demonstrate that the proposed algorithm obtains a much faster speedup beyond data parallelism, with comparable or better accuracy.
Review for NeurIPS paper: Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping
Summary and Contributions: This paper proposes to accelerate training of Transformer networks by progressively reducing Transformer layers from the network during training. First, it compares two different architectures of BERT, PostLN and PreLN. PostLN applies layer normalization after the element-wise addition in Transformer blocks. The PreLN changes the placement of the location of layer normalization by placing it only on the input stream of the sublayers. It finds that PostLN is more sensitive to the choice of hyperparameters, and training often diverges with more aggressive learning rates whereas PreLN avoids vanishing gradients and leads to more stable optimization.
Review for NeurIPS paper: Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping
The proposed method for training BERT is practically useful. My main concern on this paper is that the novelty in this paper is somewhat limited. It combines two existing techniques. One is PreLN which has been well studied in the literature for training BERT, and the other is stochastically dropping layers which was first proposed for training CV models. On the other hand, how to effectively combine these two techniques and fine tune to make them work for training BERT needs certain efforts.
Reviews: Ouroboros: On Accelerating Training of Transformer-Based Language Models
The paper introduces a new method for model-parallel training, where layers of a model are distributed across multiple accelerators. The method avoids locking in the backward pass by using stale gradients during back-propagation. I'm not aware of any prior work that took such an approach. Furthermore, the authors provide theoretical claims and empirical results to demonstrate that their method has convergence properties similar to conventional SGD, despite using stale gradients. The lack of effective model-parallel training is a major roadblock for scaling up model sizes, and the proposed approach promises to overcome this issue.
Reviews: Ouroboros: On Accelerating Training of Transformer-Based Language Models
This paper studies the problem of parallelising large transformer-based language models. It goes beyond data parallelism in that it focuses on splitting the model when it does not fit in the memory of a single GPU. The idea is to segment the model into groups such that GPUs do not sit around waiting on others to pass gradients ( this is the case for layer-wise parallel solutions where each layer is on its own GPU). The model then allows backpropagation to use stale gradients between groups. An L-layer network is split into K modules so that the weights of the network are divided into K groups and each group is placed on a GPU.
Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping
Recently, Transformer-based language models have demonstrated remarkable performance across many NLP domains. However, the unsupervised pre-training step of these models suffers from unbearable overall computational expenses. Current methods for accelerating the pre-training either rely on massive parallelism with advanced hardware or are not applicable to language models. In this work, we propose a method based on progressive layer dropping that speeds the training of Transformer-based language models, not at the cost of excessive hardware resources but from model architecture change and training technique boosted efficiency. Extensive experiments on BERT show that the proposed method achieves a 25% reduction of computation cost in FLOPS and a 24% reduction in the end-to-end wall-clock training time. Furthermore, we show that our pre-trained models are equipped with strong knowledge transferability, achieving similar or even higher accuracy in downstream tasks to baseline models.
Ouroboros: On Accelerating Training of Transformer-Based Language Models
Language models are essential for natural language processing (NLP) tasks, such as machine translation and text summarization. Remarkable performance has been demonstrated recently across many NLP domains via a Transformer-based language model with over a billion parameters, verifying the benefits of model size. Model parallelism is required if a model is too large to fit in a single computing device. Current methods for model parallelism either suffer from backward locking in backpropagation or are not applicable to language models. We propose the first model-parallel algorithm that speeds the training of Transformer-based language models.
Ouroboros: On Accelerating Training of Transformer-Based Language Models
Yang, Qian, Huo, Zhouyuan, Wang, Wenlin, Carin, Lawrence
Language models are essential for natural language processing (NLP) tasks, such as machine translation and text summarization. Remarkable performance has been demonstrated recently across many NLP domains via a Transformer-based language model with over a billion parameters, verifying the benefits of model size. Model parallelism is required if a model is too large to fit in a single computing device. Current methods for model parallelism either suffer from backward locking in backpropagation or are not applicable to language models. We propose the first model-parallel algorithm that speeds the training of Transformer-based language models.
Accelerating Training for AI Deep Learning Networks with "Chunking" - insideBIGDATA
At the International Conference on Learning Representations on May 6, IBM Research will share a deeper look around how chunk-based accumulation can speed the training for deep learning networks used for artificial intelligence (AI). The company first shared the concept and its vast potential at last year's NeurIPS conference, when it demonstrated the ability to train deep learning models with 8-bit precision while fully preserving model accuracy across all major AI data set categories: image, speech and text. This technique could accelerate training time for deep neural networks by two to four times over today's 16-bit systems. In IBM Research's new paper, titled "Accumulation Bit-Width Scaling For Ultralow Precision Training of Deep Networks," researchers explain in greater depth exactly how the concept of chunk-based accumulation works to lower the precision of accumulation from 32-bits down to 16-bits. "Chunking" takes the product and divides it into smaller groups of accumulation and then adds the result of each of these smaller groups together, leading to a significantly more accurate result than that of normal accumulation.
Accelerating Training of Deep Neural Networks with a Standardization Loss
Collins, Jasmine, Balle, Johannes, Shlens, Jonathon
A significant advance in accelerating neural network training has been the development of normalization methods, permitting the training of deep models both faster and with better accuracy. These advances come with practical challenges: for instance, batch normalization ties the prediction of individual examples with other examples within a batch, resulting in a network that is heavily dependent on batch size. Layer normalization and group normalization are data-dependent and thus must be continually used, even at test-time. To address the issues that arise from using explicit normalization techniques, we propose to replace existing normalization methods with a simple, secondary objective loss that we term a standardization loss. This formulation is flexible and robust across different batch sizes and surprisingly, this secondary objective accelerates learning on the primary training objective. Because it is a training loss, it is simply removed at test-time, and no further effort is needed to maintain normalized activations. We find that a standardization loss accelerates training on both small- and large-scale image classification experiments, works with a variety of architectures, and is largely robust to training across different batch sizes.
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Research Report (0.82)
- Instructional Material (0.66)